Welcome to my R programming sample!
In this program, I look at the locations of voting poll sites in New York City relative to population density and determine whether population is a strong predictor of the number of voting poll sites in a census tract I then also incorporate a few demographic variables into the model to determine if there are any other predictors of the number of voting poll sites in a census tract.
Using open source NYC poll site data as well as census data, I will quality check, modify, merge, visualize, and model this data to answer my questions of 1) whether population is a statistically significant predictor of the number of voting poll sites in a census tract and 2) whether other demographic variables, such as race and income, are statistically significant predictors of the number of voting poll sites in a census tract when accounting for population.
library(tidyverse) # for general programming
library(tidycensus) # for importing census data
library(tidygeocoder) # for geocoding addresses
library(sf) # for manipulating spatial data
library(tmap) # for creating maps
# Voting poll site data
pollsites <- read_csv("Voting_Poll_Sites.csv")
pollsites %>% glimpse()
## Rows: 1,231
## Columns: 20
## $ BOROUGH <chr> "BROOKLYN", NA, "QUEENS", "BROOKLYN", "BROOKLYN", "…
## $ SITE_STATUS <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "…
## $ SITE_NAME <chr> "PS 12", "Wyatt T. Walker Senior Housing", "Allen A…
## $ SITE_NUMBER <chr> "11537", "11517", "10590", "11493", "11629", "11620…
## $ STREET_NUMBER <chr> "430", "2177", "112-04", "300", "495", "549", "65-1…
## $ STREET_SUFFIX <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ STREET_NAME <chr> "Howard Avenue", "Frederick Douglass Boulevard", "1…
## $ POSTCODE <chr> "11233", "10026", "11433", "11223", "11216", "10040…
## $ CITY <chr> "Brooklyn", "New York", "Jamaica", "Brooklyn", "Bro…
## $ VOTER_ENTRANCE <chr> "430 Howard Avenue (not used for voters)", "2177 Fr…
## $ HANDICAP_ENTRANCE <chr> "Enter on Prospect Place through school yard", "217…
## $ Latitude <dbl> NA, NA, 40.69148, NA, NA, NA, 40.73694, NA, 40.6713…
## $ Longitude <dbl> NA, NA, -73.78333, NA, NA, NA, -73.81325, NA, -73.9…
## $ `Community Board` <dbl> 16, NA, 12, 15, 3, 12, 8, 9, 8, 11, 13, 3, 7, 2, 6,…
## $ `Council District` <dbl> 41, NA, 27, 47, 36, 10, 24, 29, 35, 8, 23, 17, 39, …
## $ `Census Tract` <dbl> 363, NA, 266, 37402, 265, 277, 122702, 216, 339, 17…
## $ BIN <dbl> 3039174, NA, 4435291, 3195077, 3050974, 1076751, 41…
## $ BBL <dbl> 3014570032, NA, 4123220001, 3071940001, 3018090062,…
## $ NTA <chr> "Ocean Hill", NA, "South Jamaica", "Gravesend", "Be…
## $ Location <chr> NA, NA, "(40.691478, -73.783335)", NA, NA, NA, "(40…
# Census population data (census tract level)
nyc_pop <- tidycensus::get_acs(geography = "tract", variables = "B01003_001", state = "NY", county = c("New York", "Kings", "Queens", "Bronx", "Richmond"), geometry = TRUE)
## | | | 0% | |= | 1% | |= | 2% | |== | 3% | |=== | 4% | |==== | 5% | |==== | 6% | |===== | 7% | |======= | 10% | |======== | 12% | |========= | 13% | |========== | 14% | |=========== | 16% | |============ | 17% | |============ | 18% | |============= | 19% | |============== | 19% | |================ | 23% | |================= | 25% | |================== | 26% | |==================== | 29% | |===================== | 30% | |===================== | 31% | |======================= | 32% | |======================== | 35% | |========================= | 36% | |========================== | 37% | |=========================== | 38% | |=========================== | 39% | |============================ | 40% | |============================ | 41% | |============================= | 42% | |============================== | 42% | |============================== | 43% | |=============================== | 44% | |=============================== | 45% | |================================ | 46% | |================================= | 47% | |================================== | 48% | |================================== | 49% | |=================================== | 50% | |==================================== | 51% | |==================================== | 52% | |===================================== | 53% | |====================================== | 54% | |======================================= | 55% | |======================================= | 56% | |======================================== | 57% | |======================================== | 58% | |========================================= | 59% | |========================================== | 59% | |========================================== | 60% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 63% | |============================================= | 64% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 67% | |================================================ | 68% | |================================================ | 69% | |================================================= | 70% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 73% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 76% | |====================================================== | 77% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 80% | |======================================================== | 81% | |========================================================= | 81% | |========================================================== | 82% | |========================================================== | 83% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 86% | |============================================================= | 87% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 90% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 92% | |================================================================= | 93% | |================================================================== | 94% | |=================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 97% | |==================================================================== | 98% | |===================================================================== | 98% | |======================================================================| 99% | |======================================================================| 100%
nyc_pop %>% glimpse()
## Rows: 2,327
## Columns: 6
## $ GEOID <chr> "36005023502", "36005013500", "36005009200", "36005005400", "…
## $ NAME <chr> "Census Tract 235.02; Bronx County; New York", "Census Tract …
## $ variable <chr> "B01003_001", "B01003_001", "B01003_001", "B01003_001", "B010…
## $ estimate <dbl> 4284, 3295, 5675, 5306, 4721, 2303, 6444, 1022, 3281, 1901, 7…
## $ moe <dbl> 896, 699, 696, 834, 765, 368, 873, 223, 551, 414, 1328, 474, …
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-73.906 40...., MULTIPOLYGON (((…
# Other census demographic variables (census tract level)
nyc_demo <- tidycensus::get_acs(geography = "tract", variables = c(
"B17026_001", # ratio of income to poverty level
"B03001_003", # ethnicity: hispanic or latino
"B03002_003", # race: white alone, not hispanic or latino
"B03002_004", # race: black or african american alone, not hispanic or latino
"B03002_005", # race: native american alone, not hispanic or latino
"B03002_006", # race: asian alone, not hispanic or latino
"B03002_007", # race: native hawaiian or pacific islander alone, not hispanic or latino
"B03002_008", # race: other race alone, not hispanic or latino
"B03002_009" # race: two or more races, not hispanic or latino
), state = "NY", county = c("New York", "Kings", "Queens", "Bronx", "Richmond"))
nyc_demo %>% glimpse()
## Rows: 20,943
## Columns: 5
## $ GEOID <chr> "36005000100", "36005000100", "36005000100", "36005000100", "…
## $ NAME <chr> "Census Tract 1; Bronx County; New York", "Census Tract 1; Br…
## $ variable <chr> "B03001_003", "B03002_003", "B03002_004", "B03002_005", "B030…
## $ estimate <dbl> 866, 958, 1545, 10, 79, 0, 17, 63, 0, 3198, 77, 1517, 0, 311,…
## $ moe <dbl> 287, 765, 409, 18, 59, 13, 22, 231, 13, 556, 98, 484, 19, 356…
This dataset from NYC Open Data contains geographical information about each of the voting poll sites in New York City. In addition to standard data quality checks, goal here is to assess if there is any missingness that would impact the merge to the census data
# Check for full duplicates
pollsites %>%
group_by_all() %>%
filter(n()>1) # 0 rows
## # A tibble: 0 × 20
## # Groups: BOROUGH, SITE_STATUS, SITE_NAME, SITE_NUMBER, STREET_NUMBER,
## # STREET_SUFFIX, STREET_NAME, POSTCODE, CITY, VOTER_ENTRANCE,
## # HANDICAP_ENTRANCE, Latitude, Longitude, Community Board, Council District,
## # Census Tract, BIN, BBL, NTA, Location [0]
## # ℹ 20 variables: BOROUGH <chr>, SITE_STATUS <chr>, SITE_NAME <chr>,
## # SITE_NUMBER <chr>, STREET_NUMBER <chr>, STREET_SUFFIX <lgl>,
## # STREET_NAME <chr>, POSTCODE <chr>, CITY <chr>, VOTER_ENTRANCE <chr>,
## # HANDICAP_ENTRANCE <chr>, Latitude <dbl>, Longitude <dbl>,
## # Community Board <dbl>, Council District <dbl>, Census Tract <dbl>,
## # BIN <dbl>, BBL <dbl>, NTA <chr>, Location <chr>
# Check for partial duplicates on site number
pollsites %>%
group_by(SITE_NUMBER) %>%
filter(n()>1) # 0 rows
## # A tibble: 0 × 20
## # Groups: SITE_NUMBER [0]
## # ℹ 20 variables: BOROUGH <chr>, SITE_STATUS <chr>, SITE_NAME <chr>,
## # SITE_NUMBER <chr>, STREET_NUMBER <chr>, STREET_SUFFIX <lgl>,
## # STREET_NAME <chr>, POSTCODE <chr>, CITY <chr>, VOTER_ENTRANCE <chr>,
## # HANDICAP_ENTRANCE <chr>, Latitude <dbl>, Longitude <dbl>,
## # Community Board <dbl>, Council District <dbl>, Census Tract <dbl>,
## # BIN <dbl>, BBL <dbl>, NTA <chr>, Location <chr>
# Check for missingness across all columns
colSums(is.na(pollsites))
## BOROUGH SITE_STATUS SITE_NAME SITE_NUMBER
## 3 0 0 0
## STREET_NUMBER STREET_SUFFIX STREET_NAME POSTCODE
## 0 1231 0 0
## CITY VOTER_ENTRANCE HANDICAP_ENTRANCE Latitude
## 0 1 0 124
## Longitude Community Board Council District Census Tract
## 124 3 3 3
## BIN BBL NTA Location
## 11 11 3 124
Since the census tract variable is not in the same format as the one in the census population data, I will be merging these two datasets using their spatial geometry. However, 134 rows are missing a latitude and a longitude, so I will geocode them to obtain their coordinates later on.
# Check for full duplicates
nyc_pop %>%
group_by_all() %>%
filter(n()>1) # 0 rows
## Simple feature collection with 0 features and 5 fields
## Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
## Geodetic CRS: NAD83
## # A tibble: 0 × 6
## # Groups: GEOID, NAME, variable, estimate, moe, geometry [0]
## # ℹ 6 variables: GEOID <chr>, NAME <chr>, variable <chr>, estimate <dbl>,
## # moe <dbl>, geometry <GEOMETRY [°]>
# Check for partial duplicates on census tract
nyc_pop %>%
group_by(GEOID) %>%
filter(n()>1) # 0 rows
## Simple feature collection with 0 features and 5 fields
## Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
## Geodetic CRS: NAD83
## # A tibble: 0 × 6
## # Groups: GEOID [0]
## # ℹ 6 variables: GEOID <chr>, NAME <chr>, variable <chr>, estimate <dbl>,
## # moe <dbl>, geometry <GEOMETRY [°]>
# Check for missingness across all columns
colSums(is.na(nyc_pop)) # no missing values
## GEOID NAME variable estimate moe geometry
## 0 0 0 0 0 0
Thankfully, the census population data appears to be already clean.
# Check for full duplicates
nyc_demo %>%
group_by_all() %>%
filter(n()>1) # 0 rows
## # A tibble: 0 × 5
## # Groups: GEOID, NAME, variable, estimate, moe [0]
## # ℹ 5 variables: GEOID <chr>, NAME <chr>, variable <chr>, estimate <dbl>,
## # moe <dbl>
# Check for partial duplicates on census tract and variable
nyc_demo %>%
group_by(GEOID, variable) %>%
filter(n()>1) # 0 rows
## # A tibble: 0 × 5
## # Groups: GEOID, variable [0]
## # ℹ 5 variables: GEOID <chr>, NAME <chr>, variable <chr>, estimate <dbl>,
## # moe <dbl>
# Check for missingness across all columns
colSums(is.na(nyc_demo)) # no missing values
## GEOID NAME variable estimate moe
## 0 0 0 0 0
In order to geocode the 124 poll site rows that are missing coordinates, I first have to create a single address variable to feed into the function. First, I need to fix the street name variable so that it has the appropriate suffixes after any numbers (ex: 144th St instead of 144 St).
# Isolate rows that need to be geocoded
to_geocode <- pollsites %>% filter(is.na(Latitude) & is.na(Longitude))
# Fix street name variable
to_geocode2 <- to_geocode %>% mutate(street_new = str_replace_all(
STREET_NAME,
"(\\b\\d+\\b)(?!(st|nd|rd|th))", # use regular expressions to identify numbers in the street name variable that do not already have a suffix
function(x) {
num <- as.integer(x) # extract the number from these rows
suffix <- if (num %% 100 >= 11 && num %% 100 <= 13) { # address outliers 11 and 13 which end in "th" instead of "st" or "rd"
"th"
} else {
switch(as.character(num %% 10), # extract the last digit of the number to assign it the proper suffix
"1" = "st",
"2" = "nd",
"3" = "rd",
"th")
}
paste0(num, suffix) # add the suffix to the number
}
))
# Check that the new street name variable looks correct
to_geocode2 %>%
filter(str_detect(street_new, "\\d")) %>% # only look at rows that have a number in the street name
count(street_new) # print all instances of this
## # A tibble: 46 × 2
## street_new n
## <chr> <int>
## 1 101st Avenue 1
## 2 104th Street 1
## 3 13th Avenue 1
## 4 13th Street 1
## 5 164th Street 1
## 6 18th Avenue 1
## 7 2nd Avenue 1
## 8 44th Avenue 1
## 9 4th Avenue 2
## 10 51st Avenue 1
## # ℹ 36 more rows
# Create singular address variable
to_geocode3 <- to_geocode2 %>% mutate(address = paste0(STREET_NUMBER, " ", street_new, ", ", CITY, ", ", "NY ", POSTCODE))
# Check that address looks correct
to_geocode3 %>% select(STREET_NUMBER, street_new, CITY, POSTCODE, address)
## # A tibble: 124 × 5
## STREET_NUMBER street_new CITY POSTCODE address
## <chr> <chr> <chr> <chr> <chr>
## 1 430 Howard Avenue Brooklyn 11233 430 Howard A…
## 2 2177 Frederick Douglass Boulevard New York 10026 2177 Frederi…
## 3 300 Avenue X Brooklyn 11223 300 Avenue X…
## 4 495 Gates Avenue Brooklyn 11216 495 Gates Av…
## 5 549 Audubon Avenue New York 10040 549 Audubon …
## 6 127-15 Kew Gardens Road Kew Gardens 11415 127-15 Kew G…
## 7 1716 Southern Boulevard Bronx 10460 1716 Souther…
## 8 270 West 89th Street New York 10024 270 West 89t…
## 9 137 Jamaica Avenue Brooklyn 11207 137 Jamaica …
## 10 9941 Fort Hamilton Parkway Brooklyn 11209 9941 Fort Ha…
## # ℹ 114 more rows
# Geocode
geocoded_addresses <- to_geocode3 %>% geocode(address)
# Check that all addresses were geocoded
geocoded_addresses %>% filter(is.na(lat) | is.na(long)) %>% select(address, lat, long)
## # A tibble: 3 × 3
## address lat long
## <chr> <dbl> <dbl>
## 1 110-04 Atlantic Avenue, South Richmond Hill, NY 11419 NA NA
## 2 105-25 Horace Harding Expressway North, Corona, NY 11368 NA NA
## 3 71-50 Parsons Boulevard, Fresh Meadows, NY 11365 NA NA
There are three addresses that didn’t get geocoded. I’m going to manually edit these addresses and try again
# Isolate rows that didn't get geocoded
not_geo <- geocoded_addresses %>% filter(is.na(lat) | is.na(long)) %>%
select(-c(lat, long)) # drop lat and long variables to avoid duplication later
# Fix their addresses
not_geo <- not_geo %>% mutate(address = case_when(
address == "110-04 Atlantic Avenue, South Richmond Hill, NY 11419" ~ "110-04 Atlantic Avenue, Richmond Hill, NY 11419",
address == "105-25 Horace Harding Expressway North, Corona, NY 11368" ~ "105-25 Horace Harding Expy, Corona, NY 11368",
address == "71-50 Parsons Boulevard, Fresh Meadows, NY 11365" ~ "71-50 Parsons Blvd, Flushing, NY 11365"
))
# Geocode
not_geo_geocoded <- not_geo %>% geocode(address)
# Check that all addresses were geocoded
not_geo_geocoded %>% filter(is.na(lat) | is.na(long)) %>% select(address, lat, long)
## # A tibble: 0 × 3
## # ℹ 3 variables: address <chr>, lat <dbl>, long <dbl>
Now all rows are geocoded! I’m going to stack the two sets of geocoded addresses and then add them back into the original dataset.
# Remove 3 non-geocoded rows from original geocoded address dataset
geocoded_addresses2 <- geocoded_addresses %>% filter(!is.na(lat) & !is.na(long))
# Check that only 3 rows are missing
isTRUE(geocoded_addresses2 %>% nrow() == geocoded_addresses %>% nrow() - 3)
## [1] TRUE
# Stack geocoded addresses
all_geocoded_addresses <- geocoded_addresses2 %>% rbind(not_geo_geocoded)
# Check row counts
isTRUE(all_geocoded_addresses %>% nrow() == geocoded_addresses2 %>% nrow() + 3)
## [1] TRUE
isTRUE(all_geocoded_addresses %>% nrow() == geocoded_addresses %>% nrow())
## [1] TRUE
# Make columns match original dataset
all_geocoded_addresses2 <- all_geocoded_addresses %>% mutate(Latitude = lat, Longitude = long) %>% select(-c(address, lat, long, street_new))
# Remove these rows from original dataset
pollsites2 <- pollsites %>% filter(!is.na(Latitude) & !is.na(Longitude))
# Check that exactly 124 rows are missing
isTRUE(pollsites2 %>% nrow() == pollsites %>% nrow() - 124)
## [1] TRUE
# Add geocoded rows back to original dataset
pollsites3 <- pollsites2 %>% rbind(all_geocoded_addresses2)
# Check row counts
isTRUE(pollsites3 %>% nrow() == pollsites2 %>% nrow() + 124)
## [1] TRUE
isTRUE(pollsites3 %>% nrow() == pollsites %>% nrow())
## [1] TRUE
# Check that all rows have a latitude and longitude
pollsites3 %>% filter(is.na(Latitude) | is.na(Longitude))
## # A tibble: 0 × 20
## # ℹ 20 variables: BOROUGH <chr>, SITE_STATUS <chr>, SITE_NAME <chr>,
## # SITE_NUMBER <chr>, STREET_NUMBER <chr>, STREET_SUFFIX <lgl>,
## # STREET_NAME <chr>, POSTCODE <chr>, CITY <chr>, VOTER_ENTRANCE <chr>,
## # HANDICAP_ENTRANCE <chr>, Latitude <dbl>, Longitude <dbl>,
## # Community Board <dbl>, Council District <dbl>, Census Tract <dbl>,
## # BIN <dbl>, BBL <dbl>, NTA <chr>, Location <chr>
I also need to prepare the demographic census data I loaded for a merge later on.
# Income to poverty ratio
nyc_income <- nyc_demo %>% filter(variable=="B17026_001") %>% mutate(Variable = "Income to poverty ratio") %>% rename(`Income to poverty ratio` = estimate)
# Check that only one variable remains
nyc_income %>% count(variable, Variable)
## # A tibble: 1 × 3
## variable Variable n
## <chr> <chr> <int>
## 1 B17026_001 Income to poverty ratio 2327
# Race
nyc_race <- nyc_demo %>% filter(variable!="B17026_001") %>%
mutate(Race = case_when( # Create labels for race variables
variable=="B03001_003" ~ "Hispanic or Latino of any race",
variable=="B03002_003" ~ "White alone, not Hispanic or Latino",
variable=="B03002_004" ~ "Black or African American alone, not Hispanic or Latino",
variable=="B03002_005" ~ "Native American alone, not Hispanic or Latino",
variable=="B03002_006" ~ "Asian alone, not Hispanic or Latino",
variable=="B03002_007" ~ "Native Hawaiian or Pacific Islander alone, not Hispanic or Latino",
variable=="B03002_008" ~ "Other race alone, not Hispanic or Latino",
variable=="B03002_009" ~ "Two or more races, not Hispanic or Latino",
)) %>%
group_by(GEOID) %>% mutate(Total_pop = sum(estimate)) %>% # create a total population for each census tract
ungroup() %>% mutate(Pct_pop = estimate/Total_pop*100) # create percentages of each race for each census tract
# Check creation of new race variable
nyc_race %>% count(Race, variable)
## # A tibble: 8 × 3
## Race variable n
## <chr> <chr> <int>
## 1 Asian alone, not Hispanic or Latino B03002_… 2327
## 2 Black or African American alone, not Hispanic or Latino B03002_… 2327
## 3 Hispanic or Latino of any race B03001_… 2327
## 4 Native American alone, not Hispanic or Latino B03002_… 2327
## 5 Native Hawaiian or Pacific Islander alone, not Hispanic or Lat… B03002_… 2327
## 6 Other race alone, not Hispanic or Latino B03002_… 2327
## 7 Two or more races, not Hispanic or Latino B03002_… 2327
## 8 White alone, not Hispanic or Latino B03002_… 2327
# Pivot dataset
nyc_race2 <- nyc_race %>% select(-c(variable, moe)) %>% pivot_wider(names_from = Race, values_from = c(estimate, Pct_pop))
# Check creation of total population and percent variables
nyc_race2 %>% filter(Total_pop!=`estimate_Hispanic or Latino of any race`+`estimate_White alone, not Hispanic or Latino`+`estimate_Black or African American alone, not Hispanic or Latino`+`estimate_Native American alone, not Hispanic or Latino`+`estimate_Asian alone, not Hispanic or Latino`+`estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino`+`estimate_Other race alone, not Hispanic or Latino`+`estimate_Two or more races, not Hispanic or Latino`)
## # A tibble: 0 × 19
## # ℹ 19 variables: GEOID <chr>, NAME <chr>, Total_pop <dbl>,
## # estimate_Hispanic or Latino of any race <dbl>,
## # estimate_White alone, not Hispanic or Latino <dbl>,
## # estimate_Black or African American alone, not Hispanic or Latino <dbl>,
## # estimate_Native American alone, not Hispanic or Latino <dbl>,
## # estimate_Asian alone, not Hispanic or Latino <dbl>,
## # estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino <dbl>, …
nyc_race2 %>% filter(`Pct_pop_Hispanic or Latino of any race`!= `estimate_Hispanic or Latino of any race`/Total_pop*100)
## # A tibble: 0 × 19
## # ℹ 19 variables: GEOID <chr>, NAME <chr>, Total_pop <dbl>,
## # estimate_Hispanic or Latino of any race <dbl>,
## # estimate_White alone, not Hispanic or Latino <dbl>,
## # estimate_Black or African American alone, not Hispanic or Latino <dbl>,
## # estimate_Native American alone, not Hispanic or Latino <dbl>,
## # estimate_Asian alone, not Hispanic or Latino <dbl>,
## # estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino <dbl>, …
nyc_race2 %>% filter(`Pct_pop_White alone, not Hispanic or Latino` != `estimate_White alone, not Hispanic or Latino`/Total_pop*100)
## # A tibble: 0 × 19
## # ℹ 19 variables: GEOID <chr>, NAME <chr>, Total_pop <dbl>,
## # estimate_Hispanic or Latino of any race <dbl>,
## # estimate_White alone, not Hispanic or Latino <dbl>,
## # estimate_Black or African American alone, not Hispanic or Latino <dbl>,
## # estimate_Native American alone, not Hispanic or Latino <dbl>,
## # estimate_Asian alone, not Hispanic or Latino <dbl>,
## # estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino <dbl>, …
nyc_race2 %>% filter(`Pct_pop_Black or African American alone, not Hispanic or Latino` != `estimate_Black or African American alone, not Hispanic or Latino`/Total_pop*100)
## # A tibble: 0 × 19
## # ℹ 19 variables: GEOID <chr>, NAME <chr>, Total_pop <dbl>,
## # estimate_Hispanic or Latino of any race <dbl>,
## # estimate_White alone, not Hispanic or Latino <dbl>,
## # estimate_Black or African American alone, not Hispanic or Latino <dbl>,
## # estimate_Native American alone, not Hispanic or Latino <dbl>,
## # estimate_Asian alone, not Hispanic or Latino <dbl>,
## # estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino <dbl>, …
nyc_race2 %>% filter(`Pct_pop_Native American alone, not Hispanic or Latino` != `estimate_Native American alone, not Hispanic or Latino`/Total_pop*100)
## # A tibble: 0 × 19
## # ℹ 19 variables: GEOID <chr>, NAME <chr>, Total_pop <dbl>,
## # estimate_Hispanic or Latino of any race <dbl>,
## # estimate_White alone, not Hispanic or Latino <dbl>,
## # estimate_Black or African American alone, not Hispanic or Latino <dbl>,
## # estimate_Native American alone, not Hispanic or Latino <dbl>,
## # estimate_Asian alone, not Hispanic or Latino <dbl>,
## # estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino <dbl>, …
nyc_race2 %>% filter(`Pct_pop_Asian alone, not Hispanic or Latino` != `estimate_Asian alone, not Hispanic or Latino`/Total_pop*100)
## # A tibble: 0 × 19
## # ℹ 19 variables: GEOID <chr>, NAME <chr>, Total_pop <dbl>,
## # estimate_Hispanic or Latino of any race <dbl>,
## # estimate_White alone, not Hispanic or Latino <dbl>,
## # estimate_Black or African American alone, not Hispanic or Latino <dbl>,
## # estimate_Native American alone, not Hispanic or Latino <dbl>,
## # estimate_Asian alone, not Hispanic or Latino <dbl>,
## # estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino <dbl>, …
nyc_race2 %>% filter(`Pct_pop_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino` != `estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino`/Total_pop*100)
## # A tibble: 0 × 19
## # ℹ 19 variables: GEOID <chr>, NAME <chr>, Total_pop <dbl>,
## # estimate_Hispanic or Latino of any race <dbl>,
## # estimate_White alone, not Hispanic or Latino <dbl>,
## # estimate_Black or African American alone, not Hispanic or Latino <dbl>,
## # estimate_Native American alone, not Hispanic or Latino <dbl>,
## # estimate_Asian alone, not Hispanic or Latino <dbl>,
## # estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino <dbl>, …
nyc_race2 %>% filter(`Pct_pop_Other race alone, not Hispanic or Latino` != `estimate_Other race alone, not Hispanic or Latino`/Total_pop*100)
## # A tibble: 0 × 19
## # ℹ 19 variables: GEOID <chr>, NAME <chr>, Total_pop <dbl>,
## # estimate_Hispanic or Latino of any race <dbl>,
## # estimate_White alone, not Hispanic or Latino <dbl>,
## # estimate_Black or African American alone, not Hispanic or Latino <dbl>,
## # estimate_Native American alone, not Hispanic or Latino <dbl>,
## # estimate_Asian alone, not Hispanic or Latino <dbl>,
## # estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino <dbl>, …
nyc_race2 %>% filter(`Pct_pop_Two or more races, not Hispanic or Latino` != `estimate_Two or more races, not Hispanic or Latino`/Total_pop*100)
## # A tibble: 0 × 19
## # ℹ 19 variables: GEOID <chr>, NAME <chr>, Total_pop <dbl>,
## # estimate_Hispanic or Latino of any race <dbl>,
## # estimate_White alone, not Hispanic or Latino <dbl>,
## # estimate_Black or African American alone, not Hispanic or Latino <dbl>,
## # estimate_Native American alone, not Hispanic or Latino <dbl>,
## # estimate_Asian alone, not Hispanic or Latino <dbl>,
## # estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino <dbl>, …
I will now merge the poll site data to the census population data using spatial geometry.
# Convert coordinates to a spatial object
pollsites_sf <- st_as_sf(pollsites3, coords = c("Longitude", "Latitude"), crs = st_crs(nyc_pop))
# Create "in" variables to check merge later on
pollsites_sf2 <- pollsites_sf %>% mutate(inPOLL = 1)
nyc_pop2 <- nyc_pop %>% mutate(inCENSUS = 1)
# Spatially join the two datasets
merged_df <- st_join(nyc_pop2, pollsites_sf2, left = TRUE)
# Check merge
merged_df %>% count(inPOLL, inCENSUS)
## Simple feature collection with 2 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -74.25563 ymin: 40.4961 xmax: -73.70036 ymax: 40.91771
## Geodetic CRS: NAD83
## inPOLL inCENSUS n geometry
## 1 1 1 1231 MULTIPOLYGON (((-73.81123 4...
## 2 NA 1 1345 MULTIPOLYGON (((-73.9432 40...
All rows were merged successfully.
Now, I will merge this data to the demographic datasets I created earlier to use later on.
# Create "in" variables to check merge later on
nyc_race3 <- nyc_race2 %>% mutate(inRACE = 1)
nyc_income2 <- nyc_income %>% mutate(inINCOME = 1)
merged_df2 <- merged_df %>% mutate(inMERGE = 1)
# Merge
merged_df3 <- merged_df2 %>% full_join(nyc_race3, by = "GEOID")
# Check first merge
merged_df3 %>% count(inMERGE, inRACE)
## Simple feature collection with 1 feature and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -74.25563 ymin: 40.4961 xmax: -73.70036 ymax: 40.91771
## Geodetic CRS: NAD83
## inMERGE inRACE n geometry
## 1 1 1 2576 MULTIPOLYGON (((-74.21211 4...
# Merge again
merged_df4 <- merged_df3 %>% full_join(nyc_income2, by = "GEOID")
# Check final merge
merged_df4 %>% count(inMERGE, inRACE, inINCOME)
## Simple feature collection with 1 feature and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -74.25563 ymin: 40.4961 xmax: -73.70036 ymax: 40.91771
## Geodetic CRS: NAD83
## inMERGE inRACE inINCOME n geometry
## 1 1 1 1 2576 MULTIPOLYGON (((-74.21211 4...
Now that the data is all cleaned and merged, I’m going to visualize the poll site locations relative to population density.
I will visualize the poll site locations relative to population density one borough at a time.
tmap_mode("view")
manhattan_map <- tm_shape(nyc_pop %>% filter(grepl("New York County", NAME))) + tm_fill(fill = "estimate", fill.scale = tm_scale(breaks = c(0, 1000, 2000, 3000, 5000, 10000, 15000, 20000)), fill_alpha = 0.7) + tm_shape(pollsites_sf %>% filter(BOROUGH=="MANHATTAN")) + tm_dots(fill = "navy")
manhattan_map
brooklyn_map <- tm_shape(nyc_pop %>% filter(grepl("Kings County", NAME))) + tm_fill(fill = "estimate", fill.scale = tm_scale(breaks = c(0, 1000, 2000, 3000, 5000, 10000, 15000, 20000)), fill_alpha = 0.7) + tm_shape(pollsites_sf %>% filter(BOROUGH=="BROOKLYN")) + tm_dots(fill = "navy")
brooklyn_map
queens_map <- tm_shape(nyc_pop %>% filter(grepl("Queens County", NAME))) + tm_fill(fill = "estimate", fill.scale = tm_scale(breaks = c(0, 1000, 2000, 3000, 5000, 10000, 15000, 20000)), fill_alpha = 0.7) + tm_shape(pollsites_sf %>% filter(BOROUGH=="QUEENS")) + tm_dots(fill = "navy")
queens_map
bronx_map <- tm_shape(nyc_pop %>% filter(grepl("Bronx County", NAME))) + tm_fill(fill = "estimate", fill.scale = tm_scale(breaks = c(0, 1000, 2000, 3000, 5000, 10000, 15000, 20000)), fill_alpha = 0.7) + tm_shape(pollsites_sf %>% filter(BOROUGH=="BRONX")) + tm_dots(fill = "navy")
bronx_map
si_map <- tm_shape(nyc_pop %>% filter(grepl("Richmond County", NAME))) + tm_fill(fill = "estimate", fill.scale = tm_scale(breaks = c(0, 1000, 2000, 3000, 5000, 10000, 15000, 20000)), fill_alpha = 0.7) + tm_shape(pollsites_sf %>% filter(BOROUGH=="STATEN IS")) + tm_dots(fill = "navy")
si_map
It looks like poll sites are pretty evenly distributed in terms of population density, but I’m now going to model the data to make sure that is actually true.
I am going to be running a linear model on the merged data to test my hypothesis that population is a predictor of the number of poll sites in a census tract.
First, I need to prepare the data for the model by identifying how many poll sites are in each census tract.
# Create new dataset with number of poll sites by census tract
census_tract <- merged_df4 %>% group_by(GEOID) %>% mutate(n_pollsite = sum(!is.na(SITE_NAME)))
# Check that there are 2327 unique census tracts
census_tract %>% distinct(GEOID) %>% nrow()
## [1] 2327
# Check that poll site numbers are within reason
max(census_tract$n_pollsite)
## [1] 5
min(census_tract$n_pollsite)
## [1] 0
# Collapse to 1 row per census tract
census_tract2 <- census_tract %>% select(estimate, GEOID, n_pollsite, `Income to poverty ratio`, `Pct_pop_Hispanic or Latino of any race`, `Pct_pop_White alone, not Hispanic or Latino`, `Pct_pop_Black or African American alone, not Hispanic or Latino`, `Pct_pop_Native American alone, not Hispanic or Latino`, `Pct_pop_Asian alone, not Hispanic or Latino`, `Pct_pop_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino`, `Pct_pop_Other race alone, not Hispanic or Latino`, `Pct_pop_Two or more races, not Hispanic or Latino`) %>% distinct()
census_tract2 %>% nrow() # should be 2327
## [1] 2327
Next, I will run the model.
population_model <- lm(n_pollsite ~ estimate, data = census_tract2)
population_model %>% summary()
##
## Call:
## lm(formula = n_pollsite ~ estimate, data = census_tract2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7365 -0.4799 -0.2718 0.4444 4.2715
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.010e-01 2.885e-02 3.501 0.000472 ***
## estimate 1.169e-04 6.899e-06 16.952 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6737 on 2325 degrees of freedom
## Multiple R-squared: 0.11, Adjusted R-squared: 0.1096
## F-statistic: 287.4 on 1 and 2325 DF, p-value: < 2.2e-16
Population is a statistically significant predictor of the number of poll sites in a neighborhood. I am also going to add a few other demographic variables to the model to see if there are any confounding variables, or any other predictors of the number of poll sites.
population_demo_model <- lm(n_pollsite ~ estimate + `Income to poverty ratio` + `Pct_pop_Hispanic or Latino of any race` + `Pct_pop_White alone, not Hispanic or Latino` + `Pct_pop_Black or African American alone, not Hispanic or Latino` + `Pct_pop_Native American alone, not Hispanic or Latino` + `Pct_pop_Asian alone, not Hispanic or Latino` + `Pct_pop_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino` + `Pct_pop_Other race alone, not Hispanic or Latino` + `Pct_pop_Two or more races, not Hispanic or Latino`, data = census_tract2)
population_demo_model %>% summary()
##
## Call:
## lm(formula = n_pollsite ~ estimate + `Income to poverty ratio` +
## `Pct_pop_Hispanic or Latino of any race` + `Pct_pop_White alone, not Hispanic or Latino` +
## `Pct_pop_Black or African American alone, not Hispanic or Latino` +
## `Pct_pop_Native American alone, not Hispanic or Latino` +
## `Pct_pop_Asian alone, not Hispanic or Latino` + `Pct_pop_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino` +
## `Pct_pop_Other race alone, not Hispanic or Latino` + `Pct_pop_Two or more races, not Hispanic or Latino`,
## data = census_tract2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7323 -0.4892 -0.2842 0.4514 4.2325
##
## Coefficients: (1 not defined because of singularities)
## Estimate
## (Intercept) -2.829e-02
## estimate 1.179e-04
## `Income to poverty ratio` -1.700e-05
## `Pct_pop_Hispanic or Latino of any race` 1.661e-03
## `Pct_pop_White alone, not Hispanic or Latino` 1.948e-03
## `Pct_pop_Black or African American alone, not Hispanic or Latino` 1.844e-03
## `Pct_pop_Native American alone, not Hispanic or Latino` -4.915e-03
## `Pct_pop_Asian alone, not Hispanic or Latino` 5.772e-05
## `Pct_pop_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino` 5.691e-03
## `Pct_pop_Other race alone, not Hispanic or Latino` -3.026e-03
## `Pct_pop_Two or more races, not Hispanic or Latino` NA
## Std. Error
## (Intercept) 4.239e-01
## estimate 2.005e-05
## `Income to poverty ratio` 8.654e-05
## `Pct_pop_Hispanic or Latino of any race` 4.283e-03
## `Pct_pop_White alone, not Hispanic or Latino` 4.417e-03
## `Pct_pop_Black or African American alone, not Hispanic or Latino` 4.498e-03
## `Pct_pop_Native American alone, not Hispanic or Latino` 1.706e-02
## `Pct_pop_Asian alone, not Hispanic or Latino` 4.450e-03
## `Pct_pop_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino` 1.880e-02
## `Pct_pop_Other race alone, not Hispanic or Latino` 6.999e-03
## `Pct_pop_Two or more races, not Hispanic or Latino` NA
## t value
## (Intercept) -0.067
## estimate 5.881
## `Income to poverty ratio` -0.196
## `Pct_pop_Hispanic or Latino of any race` 0.388
## `Pct_pop_White alone, not Hispanic or Latino` 0.441
## `Pct_pop_Black or African American alone, not Hispanic or Latino` 0.410
## `Pct_pop_Native American alone, not Hispanic or Latino` -0.288
## `Pct_pop_Asian alone, not Hispanic or Latino` 0.013
## `Pct_pop_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino` 0.303
## `Pct_pop_Other race alone, not Hispanic or Latino` -0.432
## `Pct_pop_Two or more races, not Hispanic or Latino` NA
## Pr(>|t|)
## (Intercept) 0.947
## estimate 4.69e-09
## `Income to poverty ratio` 0.844
## `Pct_pop_Hispanic or Latino of any race` 0.698
## `Pct_pop_White alone, not Hispanic or Latino` 0.659
## `Pct_pop_Black or African American alone, not Hispanic or Latino` 0.682
## `Pct_pop_Native American alone, not Hispanic or Latino` 0.773
## `Pct_pop_Asian alone, not Hispanic or Latino` 0.990
## `Pct_pop_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino` 0.762
## `Pct_pop_Other race alone, not Hispanic or Latino` 0.666
## `Pct_pop_Two or more races, not Hispanic or Latino` NA
##
## (Intercept)
## estimate ***
## `Income to poverty ratio`
## `Pct_pop_Hispanic or Latino of any race`
## `Pct_pop_White alone, not Hispanic or Latino`
## `Pct_pop_Black or African American alone, not Hispanic or Latino`
## `Pct_pop_Native American alone, not Hispanic or Latino`
## `Pct_pop_Asian alone, not Hispanic or Latino`
## `Pct_pop_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino`
## `Pct_pop_Other race alone, not Hispanic or Latino`
## `Pct_pop_Two or more races, not Hispanic or Latino`
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.684 on 2233 degrees of freedom
## (84 observations deleted due to missingness)
## Multiple R-squared: 0.09941, Adjusted R-squared: 0.09578
## F-statistic: 27.39 on 9 and 2233 DF, p-value: < 2.2e-16
As shown in the last linear model, the only variable that is a statistically significant predictor of the number of voting poll sites in a census tract is its population. The distribution of race and the income to poverty ratio of census tracts all do not appear to affect the number of poll sites found in that neighborhood.